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ABSTRACT 


Given a probability space, we will analyze the uncertainty, that is, the amount of informa- 
tion of a finite system, by studying the entropy of the system. We also extend the concept 
of entropy to a dynamical system by introducing a measure preserving transformation on 
a probability space. After showing some theorems and applications of entropy theory, we 
study the concept of ergodicity, which helps us to further analyze the information of the 


system. 
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Chapter 1 


Shannon Entropy 


1.1 Introduction 


The concept of entropy is used in different fields of study such as thermody- 
namics, statistical mechanics, and communication theory, just to name a few. In ther- 
modynamics, entropy is an indicator of reversibility. That is, when there is no change of 
entropy, the process is reversible. The unpredictability based on a lack of knowledge of 
positions and velocities of molecules is given by the entropy in statistical mechanics. Now, 
a different perspective of entropy is given in communication theory. Here we consider a 
message source, such as a writer or speaker. The amount of information conveyed by 
the message increases as the amount of uncertainty as to what message actually will be 
produced becomes greater {[Pie80]. Thus, in general, we can state that entropy measures 
the amount of information given by a source, and a way to describe that source is using 
ergodicity. These two concepts are part of a bigger spectrum called information theory 
and this theory will be developed using concepts of probability theory, which will help us 


to generalize and understand it mathematically. 


1.2 Properties and Axioms 


Definition 1.2.1. Letn € N and X = {21,..., 2} be a finite set with probability distri- 
bution p = (p1,.-.;Pn). That is, 0 < pj = p(aj) < 1, and these probabilities also satisfy 
n 


the condition that >> p; = 1. We usually denote this as (X,p) and call it a complete 
j=l 
system of events or finite scheme. 


The entropy or the Shannon entropy H(X) of a finite scheme (X,>p) is defined by 
n 
A(X) = -S°p; log pj. 
j=l 


We say that H(X) is the measure of uncertainty or information of the system (X,p). 


We will use the convention that 0 log 0 = 0, which is easily justified by continuity 
since xlogx > 0 as x > 0. Also, notice that adding terms of zero probability does not 
change the entropy. If the base of the logarithm is b, we denote the entropy as Hy(X). 
A common units of entropy measure are base 2 and e. If the base of the logarithm is e, 
the entropy is measured in nats. And, if the base of the logarithm is 2, the entropy is 
measured in bits. Furthermore, note that entropy is a functional of the distribution of X. 
Consequently, it does not depend on the actual values taken by the random variable X, 


but only on the probabilities [CTO06]. 


Example 1.2.1. Bernoulli Entropy Let X = {0,1} be a random variable with a prob- 
ability distribution p(a1 = 0) = 1—p and p(a2 = 1) = p. Then its entropy is given 
by 

A(X) = —plogp — (1 — p) log(1 — p). 
If we differentiate the entropy function H2(X) with respect to p, we find that H4(X) = 
Ho) = Tegra (loge p — log.(1 — p)) = 0 when p = 1/2. That is, the entropy H2(X) 


attains its maximum value of 1 bit at p= 1/2. 


Example 1.2.2. Geometric Entropy Assume that we perform a number of independent 
trials until a success happens with probability p. We define the random variable X to be 
the number of trials required until the first success. Then X is known as a geometric 


random variable with parameter p and probability distribution 
p(X =n) = (1—p)""p, n=1,2,... 


Then we find the entropy of X, 


H2(X) =— $0 (1—p)""'plog(1 — p)""'p 


n=l 


an (l—p Don— ay Lap)! Rgisee yet l-—p i) 
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Example 1.2.3. Poisson Entropy A random variable X = {0,1,2,...} ts said to be 


Poisson with parameter » if for some X > 0, 
a Ore lee 


If we calculate the entropy over all the possible values of a Poisson random 


variable, then we have 
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= \(1—log A) + . log 
Thus, the entropy of a random variable with Poisson distribution is given by 


A(X) = A(1 — log A) + oy log. 


Because the entropy is calculated using the probabilities of each value of the 
random variable, we can easily show that H(X) > 0. We get this property since 0 < p; < 
1, which implies that — log p; > 0. Given this fact, we can state that the minimum value 
of H(X) is 0, and that this minimum value is achieved whenever we have p; = 0 or 1. 
Can we also talk about a maximum value of H(X) given this finite scheme? We will first 
claim that H(X) < logn assuming that X is taking over n distinct finite values. Before 
we prove our claim, we will define the set of probability distribution, and then prove a 


lemma in [Ash67]. 


Definition 1.2.2. For n € N, A, denotes the set of all probability distributions p = 
(p1, iad Dads 1.€., 


Ay =< 0= @inesta): Sp SVS 0A 7 en 
j=l 


Lemma 1.2.1. Let p,q € An. Then 


n n 
Dy logq; < Sop; log p;, 
j=l j=l 


where the equality is true when pj = qi, 1<i<n. 


Proof. Because of the convexity of log x function, we know that logx < x —1. 


Then using this inequality, we have 


ME ag es or pj log 2 <p Dj — pj 1OP sp Sea 
Di Pg Py Dj 


log 


Given that > = s qj = 1, we obtain 


Thus 


n 
do Piles = YP - Y row <0 


And, the equality follows from the ie that log x = x — 1 if and only if ¢ = 1. 


Theorem 1.2.1. Let X = {21,...,¢%n} be a random variable with probability distribution 


P= (Pi,---,Pn). Then 
A(X) < logn, 


where the maximum value is attained if we have equally likely events, that is, pi = =, 


1l<ic<n. 


Proof. Applying the previous lemma, we have 


n nr 
H(X) —logn= —S  pilogp, = S— pilogn 
j=l j=l 


n n 1 
SS Pi log pj + Pi log = 
j=l j=l 


<0 


o) 


which shows that H(X) < logn. Note that if the random variable X has probabilities 
pj = 4 for 7 =1,...,n 


1 1 ce 1 
(X) = H(=, =) a: og — = logn, 


which is the maximum value for H(X). 


Based on the theorem, we can answer with certainty that there is a maximum 


value for H(X) = logn given a finite scheme of n outcomes. 


Since entropy measures that amount of uncertainty, it is important to define the 
entropy involving two random variables. If we let Y = {y1,..., Ym} be another finite set, 


then we define the following: 


Definition 1.2.3. Let X and Y be two random variables. The pair (X,Y) with joint 
distribution p(x, y) has a joint entropy defined as 


—S°S5 plz, y) log p(x, y). 
xEX yEeY 
Definition 1.2.4. The conditional entropy H(X|Y) of X given Y is defined by 
H(X|Y) =—$° S¢ py)p(aly) log r(aly), 
yEY rExX 
and we define conditional entropy of X given an observed value of Y = y, by 


H(X|y) = — S© p(aly) log p(aly). 


rex 


Definition 1.2.5. For two random variables X and Y with a joint distribution p(x, y), 
the mutual information I(X,Y) between them is defined by 


(x,y) 
(X,Y) = So p(a,y) log 
2 p(x)p(y) 


Notice if X and Y are independent random variables then the mutual infor- 
mation between them I(X,Y) = 0. Now, if we want to find the amount of mutual 
information between the random variable X and itself, we see that [(X,X) = H(X), 
ie., the self-mutual information is the entropy of X. Using the above definition, we can 


easily prove the following: 


As we mentioned before, the measure of dependence between X and Y is relevant 
to the computation of the mutual information I(X,Y) as well as the entropy. As another 
example, suppose that we know that Y gives all the information about X. Then we 
have that the measure of the entropy of X given Y is zero, H(X|Y) = 0, and it follows 
that there is no change of uncertainty of X given Y. The next theorem will give some 
important inequalities and illustrate the importance of dependence in order to arrive to 


equality. 


Theorem 1.2.2. Let p,q € An be the probability distributions of X and Y respectively. 
Then 


i. H(X,Y) < H(X)+H(Y) 
ii. H(X|Y) < H(X) 


In both cases, equality holds true if and only if the random variables X and Y are inde- 


pendent. 


Proof. (i) We have that 


H(X)+H(Y)=- (SH log p(a )+ Dapty ) log py ) 
= -(LDMe y) log p(x )+ drew) y) log p(y ) 
=—S 73° rla, y) log p()p(y) 
zy 
> SoS  p(z.y) log p(x, y), by lemma 1.4.1, 
= H(X,Y). 


The equality holds if and only if p(x, y) = p(x)p(y) for all x,y if and only if X 


and Y are independent. 


(ii) First we claim that the compound entropy can be written as H(X,Y) = 
H(Y)— H(X|Y). By definition we know that 


H(X|Y) =— 3° SY) p(y)p(aly) log p(aly) 


yEY rEX 


ra eyyloe Pet) 
=-S°S° v(2,y) log my) 


yeEY rEX 


—— Se SS p(x, y) (log p(x, y) — log p(y)) 


yEY rExX 
=- S° p(a,y)logp(a,y)+ SY > p(a,y) log p(y) 
yEY,cEx yEY,xEx 


= H(X,Y)—H(Y). 


This shows that the equation above is true. Now suppose that X and Y are independent 


random variables. Then 
A(X|Y) = A(X,Y)- A(Y) 


— H(X)+ H(Y)-H(Y), by (i), 
= H(X). 


If the random variables X and Y are not independent, then 


Thus, we also proved the inequality in (ii). 


1.3 Deriving The Entropy Function 


After stating and proving several properties of the entropy function H(X), we 


want to show that the definition for such function makes sense and it is well-defined. In 


order to do this, we list three more properties that uniquely define the entropy function 


[Rom92]. 


1. 


i. 


ill. 


H(pi,.--;Pn) is defined and continuous for all pj1,...,pn, where 0 < pj < 1 for i = 
1,...,n and }*",p; = 1. We want this function to be continuous so that small 


change in probabilities will result in a small change in uncertainty. 


nrvon n+1?°°? n+1 


A (; . ) < HA ( ee ). This property tells us that the uncertainty in- 
creases as the number of outcomes increases, outcomes that are equally likely to 
occur. In fact, this entropy of equal likelihood is a monotonically increasing func- 


tion. 


For c; € N and >" G =n, we have 


1 1 Cl Ck Cj 1 1 
H (=. ay =) yk (=, 6 <) +30 Su (=. - =) ; 
n n n n marie Ci Cj 
To construct this equation, let the set X = {x1,...,%,} be partitioned into nonempty 
disjoint subsets Cj, ...,C,. Let the size of each subset be |C;| = c; for 7 = 1,...,k, 
and Soy cj = n. Now, let us choose a subset C; with probability proportional to 


its size. That is to say, P(C;) = 2. After that, we choose an element from the 


subset C; with equal probability. If the element x; is in the subset C,,, then because 


0 ifixzu 
P(x;|Ci) = 1 
— ift=u 
Cu 
we have 
E Lee 
P(xj) = S> P(as|Ci)P(Ci) Tage <a 


This shows that if we choose x; this way, the probability will be the same as if we 
were to choose directly from the whole set X with equal probability. Consequently, 


the uncertainty of the outcomes remains the same. 


If we choose directly from X with equal probability, the uncertainty will be 


But now if we choose one of the subsets C}j,...,C,, the uncertainty is 


n n 
Now, once we have chosen the subset, we still have the uncertainty of choosing an 


element from that subset. Then the average uncertainty in choosing an element is 


is k 
. 1 1 Cj 1 1 
P(C,)H | —,..,—]) = —H|—-,...,—). 
ye ( ) (= +) e (= x) 


k 
1 1 ; 1 1 
nm n nm n PA n Cj Cj 
Theorem 1.3.1. A function H satisfies properties (i)-(iii) if and only if it is of the form 


n 
Hy(P1, ---:Pn) = — >> pi log py; 
i=l 


where b > 1 is the base and plogp = 0 for p= 0. 


10 


Proof. Suppose that a function H satisfies all three properties mentioned above. 
Now, pick some positive integers m and n such that m divides n and c; = m for all 


i=1,...,k. Because mk = a c; =n, we get k = ® and using property (iii) gives 


Now, let n = m® where s is also a positive integer. Then the above equation 


1 1 1 1 1 1 
A (Sn) = H (a —) +H (a =) n 


Define the function f(n) = H (-, ae 


becomes 


Sle 

“—" 
Z 
° 
2 
4 
ie) 
of 
8 


And, this is true for all positive integers m and s. Because of property (ii), we 


now get 
f(m®) < f(m***), 
sf(m) < (s+ 1)f(m). 


It follows that f(m) must be a positive function. Let us choose some positive numbers 


r,t and s so that 


ms <r < mst 


Then because f is a monotonically increasing function, 


f(m®) < f(r") < f(m**") 


11 


Also, we have 


slogm < tlogr < (s +1)logm, 


SZ log r stl 
t ~ logm t 
From the last two inequalities, we will get 


1 2 Tr) logr 1 
t~ f(m) logm ~ t 


Now since ¢ was arbitrarily chosen, we must have 


f(r) _ logr 
f(m)  logm 
i. f(r) flm) 


logr logm 


Since this is true for all positive integers r, we have 
f(r)=Clogr forsome C>0 


since we also know that f(r) > 0. Now suppose that C = 1 by choosing the 
base 6 of the logarithm appropriately. Then 


f(r)=log,r forall r>0. 


By property (iii), 


k 
C1 Ck Cj 
A .5—)= - —f(c; 
(~ yoy) = f(r) 21) 
an 
= log,n — D ) — logy ci 
i=1 
k k 
=~ log, n S- logy c; 
i=l i=1 


Cj 
=— Ss" 7 (OB c, — log, n) 


i=1 


12 


= ye % logy — % 
C1 


Since any rational py, ...,pz € (0,1) can be expressed in the form ¢,...,4 , we 


have 
Ay (p1, --- Pk) --Yn logy pi. 


But we also know that AH is a continuous function so this must also hold for 
all positive real numbers p1,...,px. Next, we will show that plogp = 0 if p = 0. For 


simplicity, let log be the natural logarithmic function of base e and notice that 


i , i log p 
im Oo = dum 
A eer 1/p 
1 
= lim [P 
p0+ —1/p? 
Therefore, 
Hy(p1, «++; Pk) = =S- polos 


holds for all nonnegative real numbers p1,...,p, where 0 < p; < 1 and ar p; = 1 for 


i=1,...,k. For the converse, it is straight forward to show that the entropy function H 


satisfies the three properties mentioned above. 


1.4 Additional Properties of Entropy and Coding Theory 


To finish this section, we will present an inequality involving binomial coeffi- 
cients, which plays an important role not only in information but also in coding theory. 


In fact, this inequality is very useful in order to prove The Noisy Coding Theorem [Rom92]. 
Lemma 1.4.1. Define the entropy function 
1 1 
H,(A) = Alog, - + plog, 7 
for0<A<1 and w=1-— 4. Then for any integer q > 2, we have 


gia) = 9H), 


13 


Proof. If q > 2, then 


1 it 
gfa) = q’ log, x +H log, 7 


= q log, Xgl! log, i 
alt 
— 2A loge x OH logs | 


_ prlogs x tH logs 7; 


= AU 


Theorem 1.4.1. Let H(A) = Alog + + (1 — A) log =x) where 0 <A< 4%. Then 


[An] 


S- (;) < grit(a) 


k=0 


where (7) is the binomial coefficient and the upper limit |An| of the summation is largest 


integer smaller or equal to nA if nA is not an integer. 


Proof. We first observe that inequality holds trivially on the endpoints of the 
values of A. Specifically, if A = 0, then H(0) = 0 and both sides of the inequality equal to 
1. Now, if A = 1/2, we have that H(1/2) = 1 and that inequality becomes 2 < 2”, which 
is true for n > 1. Now, suppose that 0 < A < 1/2. 


From the Markov’s Inequality, we know if X is a random variable that takes only non- 


negative values, then for any value a > 0 


IA 


PUX >a) - 


Let us assume that X has the form X = e!’, where Y is a random variable, and t is a 


real number. If we set a = e”®, then it becomes 


E(eY) 


P(e e”).< forall bER. 


Now, if t < 0, then we have e!” < e” if and only if tY > tb if and only if Y < 8, 


and so this is equivalent to 


E( elY ) 
etb 


P(Y <b)< forall bBER and t<0O. 


14 


If Y is a binomial random variable, with parameters (n, p), then 


PY <b)= 3 (j,) har" 


k=0 
where g = 1—p. Furthermore, E(e!” ) is the binomial moment generating function, which 


is well known to be 


E(e™) = (q + pet)”. 


Thus, 
b 


n n— — n 
» (@iz * <e (q+ pe’). 


k=0 
Setting b = An, where 0 < A < 1, we get 


b 


De () ota < eO(g + pet)" on 


k=0 
valid for t < 0. Let 2 = e' and f(z) = x-*"(q+ px)”. Since t < 0, we will minimize f 


over 0 << a <1. By differentiating f with respect to x, we get 


f'(@) = na" (q + pr) [-A(q + pa) + pal. 
Thus, the value of x that will minimize f is 


_ Aq 
Lp 


x 


where pp = 1 — A, and \ < p. Substituting this value of x into f gives 


—An n —An n 
ia) Ces) a) ee) 
Lp Lp Lp ft 

-(2) "() 
Lup ft 
Se argh? 


and (1.1) becomes 


An 


os & pig © Ee STAD HNO GHN 


k=0 


15 


for A < p. Setting p= q= 5 gives 

An Hy 

> (sere 
k=0 


for \ < $- From 1.4.1, we know A7**% 7H” = Qn[-AlogA—w log u] — QrH() and the above 


inequality becomes 
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Chapter 2 


The Kolmogorov-Sinai Entropy 


2.1 Introduction 


In the previous chapter, we developed Shannon’s way of measuring the infor- 
mation of a system. This notion of measuring the amount of uncertainty of source, 
represented as a random variable along with its distribution, provided us with a proba- 
bilistic way of quantifying the amount of uncertainty, and we called this entropy. Now, 
in this chapter, we extend the concept of the entropy to a dynamical system, which is a 
description of a physical system and its evolution over time. Therefore, we introduce the 
concept of measure preserving dynamical systems and measure its unpredictability. We 
will be able to state how unpredictable is a dynamical system depending on its entropy. 
The higher the unpredictability, the higher the entropy. Furthermore, we will be able to 
determine how the structure of two dynamical systems relates, i.e., whether or not two 
dynamical system are isomorphic. To start, we will define some basic concepts and a 


probability measure, found in [Shr04], so that we can define a dynamical system. 


2.2 The Kolmogorov-Sinai Theorem 


Definition 2.2.1. Let X be an arbitrary set. A collection X of subsets of X is a a-algebra 
of X if 


i. X € X; 


ii. If AE X, then A° © X; 


17 


i. Tf (Ay eee N) GX; them | jen Ay See 


Definition 2.2.2. Let X be an arbitrary set, and X be a o-algebra of X. A function 


po: X++ [0,1] is a probability measure if it satisfies the following properties: 


ii. p(X) =1; 


iii. For every disjoint sequence (A, :n EN) in X, we have 
(oe) [o-e) 
; (U a SS aay 
i=l i=l 


Definition 2.2.3. Let (X,X,) be a probability measure space, and S: X + X. The 
transformation S is said to be measurable if S~'X C X. That is, S~'A € X for every 
AEX. Let S be measurable. Then S is called a measure preserving transformation with 
respect to ys if 

u(S~1(A)) =p(A) for every AEX. 


Definition 2.2.4. Let (X,%X,) be a probability measure space, and let S : X > X 
be a one-to-one measure preserving transformation. If S~! is measurable, that is, S is 
invertible, then 


Sx =x = SX 


and S~' is also a measure preserving transformation. Now, the space (X,X, 1, $) is called 


a dynamical system, where S is measure preserving and not necessarily invertible. 


Definition 2.2.5. Let 1 < p< co. We say that the space L?(X,X,) consists of all 


complex-valued measurable functions f on X that satisfy 


| LF (@)Paplz) <0. 
xX 


Then, if f © L?(X,X,), we define the L? norm of f by 


ise = (f ifePan(2)) 


If we let p=1, the space L'(X,X, 1) consists of all integrable functions on X, 


and, together with || - ||1, is a complete normed vector space. 


18 


Definition 2.2.6. For our purpose, we denote L'-space of (X,X,) by L1(X). If Y is a 
o-subalgebra of X and f € L'(X), let 


w= [td Ae. 


We notice that pr is a countably additive measure on Y) and is absolutely con- 
tinuous with respect to yp. That is, if A € X and y(A) = 0, then ps(A) = 0. By 
Radon-Nikodym Theorem, there is a unique 2)-measurable function g € L1(2)) such that 


w= fod, Aey. 


g is unique in the p-a.e. sense. If we denote g = E(f|2)), then g is called the conditional 
expectation of f relative to 2). If we let f = 1, be the indicator function of A € X, then 
we denote FE(14|2)) = P(A|)), which is the conditional probability of A relative to 4) 
[Kak99]. 


Definition 2.2.7. Let 2) be a o-subalgebra of X. A Y)-partition is a finite Y-measurable 
partition 2 of X. That is, 


1 Qe=f Acs Ap ie y), 
os A; N Ay =0 if 7 #k, 
3. Uja1 Aj = X. 


Definition 2.2.8. Consider the dynamical system (X,X,p,S) and let the set of all 
Y)-partitions be denoted by P(Q)). If we let A,B € P(X), then we define the following 
Q)-partitions: 


AVB={ANB: AEA, Be B} 


and 


Sa={S tA: ACR 


Now, let us define the partition 


n—-1 
VV S IQ, 
j=0 
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Definition 2.2.9. Let & = {Aj,..., An} © P(X). The entropy H(Q) of a partition A is 
defined by 


=D ) log (Ai) 
or A) log u(A 


Aer 


If we let the entropy function I(2) of 2 be defined by 


— $5 1a(-) log u(A) 


AEA 


then we have 


Definition 2.2.10. We define the conditional entropy function I(2U|Q)) as 


1(29)() = — SP 1a) log P(AID)(>. 


AEA 


Also, the conditional entropy H(2\Q)) is defined by 


Hay) = BUCAY) = f Taya. (2.1) 


Since the entropy H(2l) of a finite partition 21 € P(X) can be expressed as the 
Shannon Entropy, we also express the conditional entropy H(2l|2)) as 


AY) = E(L(AY)) 


B(E(I(A9)|Y)) 
=- > | PED) og PA) de 
AEA 


Definition 2.2.11. Let & € P(X). Denote A = o(A), which is the o-algebra generated 
by A. That is, A is the smallest o-subalgebra of ¥ that contains A. Also, if 2)1,Ne2 are 
o-subalgebras, then let us denote Y1 V Yo = 0(Yi UYz), t.e., the a-algebra generated by 
the union of 21 and Yo. 


Notice that P(A|%) = oes w(A|B)1p for A € X, where p(A|B) is the condi- 


tional probability of A given B. Then we define 


H(2\8) = S> w(B) S>{—-n(AIB) log n(A|B)}, (2.2) 


Bes AEA 
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where — >> p(A|B) log w(A|B) is the conditional entropy of 2% given B € B and the 
AEA 


above equation is the average conditional entropy of 2 given B. 


We also say that 2 < 8 means that % is finer than 2, that is, each A € 2 can 


be expressed as a union of some elements in SB. 
Next, we state some fundamental theorems and lemmas so that we can prove 
the Kolmorogov-Sinai Entropy theorem. 
Theorem 2.2.1. Let 4,8 € P(X) and Y, Y1, Y. be o-subalgebras of X. 
1. H(Al{O, X}) = HA. 
2. H(AV Bl) = H(AlY) + A(BlA VY). 
3. H(AV B) = H(A) + H(BlA). 
4. A <5 B= A(AY) < A(Bl|Y). 
5, WB SS DY HD | 
6. Y. CD. > HAY.) < HAY.) 
7 A(2M) < H(A). 
8. H(AV BY) < A(A\Y) + H(BlY). 
9. HAV B) < H(A) + A(B). 
10. H(S—!2|S-1Q) = H(A). 
WAS = BOD: 
12. I(S1A|S—1Y) = (AlN) oS. 
18..1(S Qh) = 1) oS. 
Definition 2.2.12. Let Ae P(X). The entropy H(2,S) of S relative to W is defined by 


n-1 
H(%,S) = lim “H (Vv sa] . (2.3) 


j=0 
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We also define H(S) of S or the Kolmogorov-Sinai entropy of S by 
H(S) = sup{H (a, S) : Ae P(*)}. 
Theorem 2.2.2. Let 1,B € P(Q). Then 
H(,S) < H(S,S) + H(A\s). 

Lemma 2.2.1. [If Yn TY and Ae P(X), then: 

1. T(A\Dy) 4 T(AlY) p-a.e. and in L'. 

2. H(A|Dn) | A(2|Q). 
Theorem 2.2.3. (Kolmogorov-Sinai). If S is invertible and I € P(X) is such that 


VS" =X, then H(S) = H(A, 8). 


n=—CO 


Proof. Let Ay = ‘. V SPQ for n > 1. Then 


=—n 


= H(A, S). 


Now, let B € P(X). Then, 


H(%,S) < H(An, S) + H(BlAn), theorem 2.2.2, 
= H(A, S) + H(BlAn) 
> H(,S) (noo), 


since H(8|Ay) | H (|X) by lemma 2.2.1 (2) and notice that H(8|X) = 0. This 
means that H(8,S) < H(2,S) for 8 € P(X), which implies that 


A(A,S) = sup{H(B, S$): Be P(*X)} = A(S). 
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The Kolmogorov-Sinai theorem provides us with a way to calculate the entropy 
H(S) of an invertible transformation S' by calculating the entropy H(2,S) of that invert- 
ible transformation S relative to a particular partition 21 of X. Moreover, this theorem 


will help us to compute the entropy of Bernoulli shifts and Markov shifts [Kak99]. 


2.3. Bernoulli and Markov Shifts 


We also want to study various dynamical systems and their entropies. Thus, we 


would like to know if these systems are isomorphic or not. 


Definition 2.3.1. Let (X;, Xi, uj, S;) (i = 1,2) be two dynamical systems. These systems 
are said to be isomorphic, denoted S; = So, if there exists some one-to-one and onto 


mapping p: X, + Xo such that 
i. for any subset Ay C Xy, Ay © X1 iff p(A1) € Xe, and w1(A1) = Me(y(A1)); 
ii. po S; = S204, that is, p(S121) = Sep(x1) for x1 € Xj. 

In this case, y is called an isomorphism. 


As a matter of fact, if S; = S2, then H(S,) = H(S2). That is, the Kolmogorov- 
Sinai entropy of measure preserving transformations is invariant under isomorphism. 
Consequently, if H(S,;) # H(S2), then S; 4 Sg. Next, we define and compute the 
entropy of Bernoulli and Markov shifts. 


Example 2.3.1. Bernoulli Shifts. Let (Xo, p) be a finite scheme, where Xo = {a1,..., a} 
and p = (pi,-.-,p1) € Aj, so that p(aj) = pj, 1 <j <1. Consider the infinite Cartesian 
product 


Ma XP he So. 0 yeas) aR Ee Xoyk EZ}, 


where Z = {0,+1, +2,...}, and the shift S on X given by 


/ / / / 
S(t 95 Hoo Sa) OP ye) Where B= Eke Ze 
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A cylinder set is defined by 


= {los Me4 shy Ae, ax) oko = xe, i < k < jf 


and let 
po([2} - + a5]) = p(a?) ---p(w}). 

Extend io to the o-algebra X generated by all cylinder sets, denoted by yu. Note 
that S' is measure-preserving w.r.t. and hence (X,X, u,S) is a dynamical system. The 
shift S is called a (pi,...,p1)-Bernoulli shift. Since A = {[xo = a4], ..., [vo = ail} ts a finite 
partition of X and 3 ies S"Q = ¥X by definition, we have 


1 n— 
H(S) = H(A,S) = lim —H (“v sta) 


noo nN 


= 
Now Ly S-*QY = {[xo +++ api] 2} € X0,0< 7 <n—1} and hence 


(MSM) == Salle 9-al) og lta -e9-a) 


LO y++-5%n—-1EX0 


=- SE elo + tn-1]) log u([xo]) ++ u([n—1)) 


LO;-+-;%n—1E€X0 


en Se pl 0) log pu([ao]) — +++ — Ss" [([2n—1]) log pu([%n—1]) 


roe Xo Ln-1EX0 


=n (Ql) 


since pu([a;]) = p(aj) = p; for 1 <j <n. This implies that 


H(S) = H(A) =— doi log p;. 


A simple geometric representation of Bernoulli shifts is given by the Baker’s 
Transformation, which is an area-preserving transformation of the unit square onto itself. 


The figure 2.1 will illustrate how to construct ( Bernoulli shift. 


33)" 


24 


Step 1. Cut unit square into two 
columns of equal width. 
+ bs 
A e B 
Step 2. Squeeze each column to a rectangle of height 1/2 and base 
B’ 
Step 3. Put B' on top of A’ to form 
a square. B 
/ 
A e Tx 


Figure 2.1: p= (5, $)-Bernoulli shift 


In the same manner, we can construct a (5 a0 $,4)- Bernoulli shift using the 


Baker’s Transformation. If we compute the entropies of a (5 +)- Bernoulli shift and a 
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G, 3 3)- Bernoulli shift, we will notice that the entropies are not the same; consequently, 
these Bernoulli shifts are not isomorphic. In fact, we can state that two Bernoulli shifts 


with the same entropy are isomorphic and this was proved by Ornstein in 1970. 


Example 2.3.2. Markov Shifts. Now, consider a finite scheme (Xo, p) and the infinite 

product space X = AG with a Bernoulli shift S. Let M = [mi] be an | x | stochastic 
1 

motrin, 1.¢., ma; > 0, x my = 1 for 1 <i,j <1, andm = (m,...,m) be a probability 


distribution such that > mymMy =m; forl <j <1. Each mj indicates the transition 
probability from the bie a; to the state a; and the row vector of m is fixed by M in the 
sense that mM =m. We always assume that m; > 0 for everyi =1,...,1. Now we define 


Lo on MN, the set of all cylinder sets, by 
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Lo( [aig aes ai, ]) = Mig Migiy °° Min_rin: 


Lo is uniquely extended to a measure pp on X which is S-invariant. The shift S 


is called an (M,m)-Markov shift. 


To compute the entropy of an (M,m)-Markov shift S consider a partition A = 
{[zo = ai], .-., [wo = ai]} € P(X), which satisfies Vv S"% =X. As the example before, 


n-1 
(Msn) == lla nal) og lta -e9-a) 
£0 3+5%n-1€X0 


l 
=— SY) migminis +++ Mig ain—1 1OE Mig Migés ++ Min—ainat 


40,--y2n—1 
L l 
=— ) Mi, log Mig — (n — 1) ) mimi log mi; 
io=l ij=l 


l l 
since Y) mum =m; and YY my =1 for 1 <i,j <1. By dividing n and letting n > oo 
i=l j=l 
we get 
l 


H(S) = Se MMNiZ log mij. 
ij=1 
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Chapter 3 


Relative Entropy and 
Kullback-Leibler Information 


3.1 Introduction 


In chapter 1, we define mutual information as a measure of the amount of 
information one random variable contains about the other one. We also observed the 
self-information of a random variable becomes the entropy of the same random variable. 
A more general case of a mutual information is the relative entropy. The relative entropy 
is a measure of the distance between two probability distributions. Here we will define 
the relative entropy H(p|q) for two finite probability distributions p and q and provide 
certain properties and an application in the field of statistics. In addition, we define the 
relative entropy for two distributions p and q of a continuous random variable and extend 


the concept to an arbitrary pair of probability measures. 


3.2 Discrete Relative Entropy and Its Properties 


Definition 3.2.1. Let p,q € Ay. The relative entropy H(p|q) of p w.r.t. (with respect 
to) q is given by 


n 
o 
H(plq) = 5_ pj log =. 
j=l 49 


As before, we use the convention that Olog 2 = 0 and Olog + = 0. But if p; > 0 and 
qj; = 9 for some j, then we define p; log a = oo and H(p|q) =o. 
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The notion of the relative entropy as distance of two probability distributions is 
not a metric because it does not satisfy the triangle inequality or the symmetry property 
of a metric. Notwithstanding, the relative entropy H(p|q) is a measure of the ineffi- 
ciency assumption [Kul97]. The next example illustrates that the relative entropy is not 


symmetric. 


Example 3.2.1. Let the random variable X = {0,1} and suppose that p and q are two 
probability distributions of X. Let p= (1—p,p), and let q= (1-—4q,q). Now, we have 


L=p p 
Ho(plq) = (1 — p) log + plog = 
2(plq) = (1 — p) log -—® + plog ® 
and 
ae | q 
H(q\p) = (1 — q) log + qlog -. 
(alp) = (1 ~ 4) log 2 + alog * 


Notice if p = q, then H2(p|q) = Ho(q\p) = 0. Now suppose that p = i and 
q= :. Then 


1 =e 1 . 3 6 1 
Hi: = 1 l 2 416g] =) + — = 0.0832 bit. 
2(p|q) ( D °8 FT i A aoe A 8a A t 
i Ho(q\p) = (1 ay ae l B71, ULE 29/0606:bd 
= (@) t O = O = 0. : 
2\q|P 8 Sy 1 PL 8 86 8 


Thus, H2(p\q) 4 Ho(q\p). 


In the next definition, we consider p(x,y) and q(x,y) be two joint probability 
distributions for the pair of random variables (X,Y) and p(a) be the probability distri- 
bution for X. 


Definition 3.2.2. Given the joint probabilities p(x, y) and q(x, y), the conditional relative 


entropy is defined as 


H(p(ylx)\a(ylx)) = > pe) © plyle) log ee. 
: : q(y|x) 


The conditional relative entropy is the average of the relative entropies between 
the conditional probability distributions p(y|a) and q(y|x) averaged over the probability 
distribution p(x). Now, we define the relative entropy between two joint probability 
distributions on (X,Y) in terms of a sum of relative entropy and a conditional relative 


entropy. 
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Theorem 3.2.1. The relative entropy between two joint probability distributions p(x, y) 
and q(x,y) on (X,Y) is given by 


A(p(z,y)la(z,y)) = A(p(x)|q(z)) + A(p(yla)la(ylx)). 


Proof. Observe that 


H(p(,y)\a(, y)) = > Yi vle,y) log MY? 


Ss q(x, y) 
= e.y)low Pit) Pl) 
dy LP 8 Gy atule) 
25 Spey x > v(e,9) low Te wie) 
xEX yEeY q(«) xEX yEeY (ula ) 
= © p(x,y)logp(x,y)+ S> p(x,y) log p(y) 
yEY,xEX yEY,cEX 


= H(p(x)|q(x)) + A(p(yl2)|a(y|z)). 


To prove some fundamental properties in information theory, we have used the 
concept of convexity. In this chapter, we will define convexity and state the Jensen’s 
inequality. Then, we will use these concepts to show some properties of the relative 


entropy. 


Definition 3.2.3. A function y : (a,b) > R is convex if 


~ (>: sn) < a rip(zi) 
i=l i=l 


for all x; € (a,b) and A; € [0,1] with SW, A; = 1. The equality holds when x; = « for 
some x € (a,b) and alli with \; > 0. We also say that ~ is strictly convex if 


7) (= vn) < S- N(x). 
i=1 i=1 


Definition 3.2.4. A function f is concave if —f is convex. A function is convex if it 


always lies bellow any chord. A function is concave if it always lies above any chord. 


We also say a function f is convex if the function f has a second derivative that 


is nonnegative over an interval [Yeh06]. 
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Lemma 3.2.1. Jensen’s inequality. Let y : (a,b) > R be a convex function and let 


f : X — (a,b) be a measurable function in L! on a probability space (X,X,u). Then 


’ ( [re an(v)) < f p(t) dula, 


and if p is strictly convex, then 
o( f se) away) < f otra) date) 
unless f(x) =t for p-almost every x © X for some fixed t € (a,b). 
In particular, If f is a convex function and X is a random variable, 
Ef(X) > f(EX). 
The equality holds true whenever X = EX with probability 1, that is, X is constant. 


Lemma 3.2.2. (Log-Sum Inequality) Let p;,q, >0 for1<i<n. Then 
. a Pi 
Die 
2 Pilg loge “> = () log =——_ ar 
where equality holds if and only if a = constant. 


Proof. First we claim that the function f(t) = tlogt is strictly convex. It is 
sufficient to show that f” > 0. Then, if we differentiate twice, f”(t) = + >0 when t > 0. 
By Jensen’s inequality, we also know that >7j_, aif (ti) > FOC PL, ati) with ay > 0, 


ay +-+-+an=1. Now, let p;,q; > 0 for 1 <7<n. Then we have 
Yo pilos = oat (2) 
_ qi , Gi 
i=1 
-_ qd 
ad stat (*) 


> Dal oS 4 2) 
Eu (e) 
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i=1 Vi 


n n 
= Y pitog (Se). 
i=1 
We next establish some important properties of the relative entropy by using 


lemma 3.2.2. 


Theorem 3.2.2. Let p and q be two probability distributions of the random variable X. 


Then we have 
i. Nonnegativity. H(p|q) > 0, and H(p\q) = 0 if and only if p = @ 


iu. Convexity. Let p,p2,q1,q2 be probability distributions of the random variable X. 


Then, for a € [0,1], we have 
H(ap; + (1— @)palag + (1 — a)qz) < aH (pi|q) + (1 — a) H (pela). 


iti. Partition Inequality. If X= {Aj,..., Ax} is a partition of X. That is, U_, Aj = 
X, and A;N Aj =0 whenever i 4 j. Define 


then 
H(p|q) > H(pulqu), 


where equality holds if and only if p(x) = q(x) for each x € Aj. 


Proof. 
(i.) By definition, we have 


= ie p(x) 
A(pl|q) = 2,7 ) log ae 
vex P(Z) Soe 
> en) log (Se a) ; by 1 3.2.2, 
{Hee 
= llog I 
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and it is clear that H(p|q) = 0 if and only if p(x) = q(x) for alla € X. 
(ii.) Let pi, p2,q1, G2 be probability distributions of the random variable X and 


€ [0,1]. Then 


H (ap; + (1 — a)p2laq, + (1 — a)q2) 


= So (opr(#) + (1 = a)pale)) log ee 


EX agi (x) + (1 — a)q2(x) 
= Yoni e) oe a) Uae 
+ alee) ee oa) Foal 
< Fons 
+ pa — a)p2(x) log i — eet by lemma 3.2.2, 
= oD Pla) log pe (toa) dre) log ne 


= aH (pi|q) + (1 — a) A(pe|qa). 


(iii.) Let 2 = {Aj,..., Ax} be a partition of X. Then 


H(p|q) = >_ p(x) log a) 


rEx q(x) 
: (z) 
=S~ ple) log ia) 
i=1 rE A; 
k 
DageAs p(x) 
> > (= nn) log Svea, (a) 
: (i) 
= d Pali) log Gs , by hypothesis, 
= H(pa|qx). 


One of the applications that involves the relative entropy is a statistical hypoth- 


esis testing problem [Kul97]. In the simplest case, let 
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Ho: p = (p(a1), +) P(@n)) 
Ay : q = (q(a1), +; G(@n)), 


and we have to decide which one is true depending on the samples of size k, where 
p,q € An and Xo = {a1,...,an}. We need to find a set A C baa so that, for a sample 
(x1,..-,%%) € A, then Ho is accepted; otherwise H; is accepted. Here, for some € € (0, 1), 
type 1 error probability P(A) satisfies 


k 
> = [ [ e@) = P(A) < 


(21,0, )EAI=1 


and type 2 error probability Q(A) is given by 


ie 
S> [[aley) =1- Q(4) = (49, 


(x1,.,0,)GAIF1 


which should be minimized. Thus, for any € € (0,1) and the sample size k > 1 let 


B(k, €) = min{Q(A°) : P(A°) > 1-6, AC X*}. 


Then we claim that 


lim ZB (k, €) -- Lone) (a; Jog a) = —H(p|q). 


k300k a;) 


It follows from the claim that we can minimize the probability Q(A‘°) with 
respect to P(A°) by calculating the relative entropy —H(p|q), the negative entropy of p 
with respect to g. A proof of this claim is found in [Kak99]. Next, we define the relative 


entropy given a continuous random variable. 


3.3 Continuous Entropy and Relative Entropy 


Definition 3.3.1. Let X be a random variable with cumulative distribution function 
F(x) = P(X < 2). If F(x) is continuous, the random variable is said to be continuous. 


Let f(x) = F’(x) when the derivative is defined. If 


[1@ a= 
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f(x) is called the probability density function for X. The set where f(x) > 0 is called the 
support set of X. Now, we define the expected value of X by 


E(X).= [ef@ dx. 


Definition 3.3.2. The entropy H(X) of a continuous random variable X with density 
f(x) is defined as 
H(X)=— | f(a)lo f(e) ae, 


where S is the support set of the random variable, and given that the integral exists. 


Now, we define the relative entropy of a continuous random variable X and show 


some of its properties. 


Definition 3.3.3. The relative entropy H(f\g) between two densities f and g is defined 
by 
_ Z. 
A(flg) = flog 


Note that H(f\|g) is finite only if the support set of f is contained in the support set of g. 


In chapter 1, the discrete entropy is maximized when a set of events are equally 
likely, that is, uniformly distributed. In the continuous case, the result is similar and it 


is shown in the next example. 


Example 3.3.1. (Uniform Distribution) Let f be the probability density function on 
(a,b) given by 


A — 


0, otherwise 


Let us find the entropy of a random variable X with a uniform density function f. Then, 
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Now, assume that f(x) is a probability density function on (a,b) and g(x) is the 


uniform density function on (a,b). Then, 


= x) lo fl) 2 
H(fl9) = jl )og 7 d 


f(x) (log f(x) — log g(x) dex 
(a,b) 


II 
x 
ay 
> 
7 
~~ 
e) 
0g 
oS 
SS, 
& 
Q 
8 


= —H(x) + log(b- a) > 0. 


First we notice that, as the discrete case, the relative entropy H(f\g) > 0. Also, 
this example provides us with an upper bound for a probability density function on the 


interval (a,b), t.e., H(X) < log(b— a). 


We also find the relative entropy between two normal and exponential probabil- 


ity density functions on a random variable X. 


Example 3.3.2. (Normal Distribution) A random variable X is normally distributed 
with parameters ys and o? if the probability density function f(x) is defined by 
1 


fae meee, —00 <%< OO. 
TO 


Suppose that f and g are normally distributed with parameters j,, 07 and [2 , 


a3, respectively. Then we have 


H(flg) =f #e)t08 2} a 
1g (@-W1)? /207 
4/2002 

= f Fc) 08 T_ ne=j)? 72a dx 


2 
2105 


= f seyiog(B) "tes fry [- Sees SoH 
2 


0. re 2 pe 2 


207 203 


1 1 1 ; 
1 Var(X) + —> = asd 
slog S$ 1— 5 aVar(x) + 5m f Se)(e— m+ on In) dr 
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2 


1 oo 1 41 
5 log a oF ae [toe [1)° dx 


2 
205 


+ 559-20 ~ sa) f Hele mn) det syn ~ aa)? f Fe) dé 
1 


o2 
5 t gg Var(X) + 2G 2) (BE) ~ ge) + (en — 12)" 


where the variance Var(X) = E[(X — 41)7] = f f(x)(a@ — 1)? = 0? . 


Example 3.3.3. (Exponential Distribution) A continuous random variable X is said 


to be exponentially distributed for X > 0 if the density function is defined as 


Neo af ao SO 


0, if <0. 


The entropy H(X) with exponentially density function f is calculated as follows: 


H(X) =~ f (2) l08 F(v) ae 
=e / re *” log Ae *” dx 
= / de** log A + log e~*"] dx 
- - fre log A dx — fe (-d2) dx 
==log fre de +d fre do 


=logd-1+- E(X) 


=—logA +1. 


We now find the relative entropy of two exponentially distributed functions f 


with parameter 1 and g with parameter Ag. 


f(z) 
g(x) = 


= | foes f(x) — log g(x)| dx 


H(flg) = ii f(a) log 
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oe ) log f(x )ar— f fe ) log g(a 


x)= f der AI7 Nog A2 + log e~ Aae| dx 


= —(-logA1 +1) log Xe f ne d+ [ dere dx 
— log Ay = 1 — log Ag . 1 + dy f oe dx 


= log v4 _ 1 — log Ax + A2 -E(X) 


where E(X) = f rye” dr = a 


Now, we extend the definition of the relative entropy between two probability 


measures. 


Definition 3.3.4. Let (X,X) be a measurable space. Let P(X) denote the set of all 
probability measures on X and P(Q)) denote set of all finite XI)-measurable partitions of X, 
where XY) is a a-subalgebra of X. Let u,v € P(X). Then, the relative entropy of uw with 
respect to v relative to Y) is defined by 


Hglde) =| > wa) ae mh 


AEA 
If Y = X, then we write Hx(u\v) = H(p\v) and call it the relative entropy of 


Li with respect to v. Note that the relative entropy is defined for any pair of probability 


MeEASUTES. 


If wu < v, p is absolutely continuous with respect to v, relative entropy has an 


integral form. That is, 


d du d 
nur) =f (Brow) av = f tox an. 


and if w is not absolutely continuous with respect to v, then H(p|v) = oo. 


Definition 3.3.5. (Kullback-Leibler information) Let € and 7 be real random vari- 
ables on (X,X,), so that they have probability distributions ue and py given by 


He(A) = W(E“"(A)), (A) = H(A), Ac 8, 
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respectively and 8 is the Borel o-algebra of R. Suppose that wg and py, are absolutely 


continuous with respect to the Lebesgue measure dt of R, so that we have the probability 


density functions f and g, respectively given by 


have 


dpe Ubon 71 1 
=a = Per (R= (Rd): 
foo, o> = ® (R, dt) 


Then the Kullback-Leibler information between € and p is given by 


(él) = [ (f(#) log f(t) — f(é) log g(t) at. 


Observe that given the definition above and the integral form of H(pe|/un), we 


IE) = if (f(#) log f(t) — f(t) log g(t)) at 


dug, dhe 
= | —— log —-du 
I din  dpty " 


= H(pg|Hn). 


Therefore, the relative entropy, in general, is interpreted as the Kullback-Leibler 


information. 


3.4 Birkhoff Pointwise Ergodic Theorem 


So far, we have introduced a way of measuring the amount of information by 


calculating the entropy of a system. Now, we want a way of studying the long term 


average behavior of a system that evolves over time. In this section, we state and show 


the proof of a main theorem found in Abstract Methods in Information Theory by [Kak99]. 


This theorem will help us to describe that long term average behavior. We say that the 


collection of all states of the system form a space X, and the evolution is represented by 


a transformation S : X — X. Since we want S' to preserve the basic structure on X, we 


define S as a measure preserving transformation. 
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Definition 3.4.1. Let S be a measure preserving transformation on a probability space 
(X,X,u). The map S is said to be ergodic if for every measurable set A satisfying 
S-'A=A, we have (A) = 0 or p(A) = 1. 


The next theorem is a generalization of the Strong Law of Large Numbers. That 
is, if we have a sequence X1, X9,... of independent and identically distributed random 
variables with the expectation E(X;) = w, then im = X; = p almost surely. We say 
that an event is almost surely if the event occurs with probability one even if it does not 
contain all possible outcomes. The outcome or set of outcomes not contained in the event 


has probability zero [Dur05]. 


Theorem 3.4.1. Birkhoff Pointwise Ergodic Theorem Let (X,X,) be a probability 
space and S : X —+ X a measure preserving transformation and f € L'(X,p). Then, 


there exists a unique fs € L'(X,p) such that 


exits a.e., is S-invariant and de fdp = Ay fs du. If moreover S' is ergodic, then fg is a 


constant a.e. and fg = Jy fdu. 


Proof. Let f € L'(X,) and without loss of generality consider f > 0. Define 
fn(z) = f(a) +---+ f(S"12), f = lim sup, and f = lim inf ~. Then f and f are 
noo n — noo n = 


S-invariant. Observe that 


f( St) = lim ing 2(52) 
— noo n 
= lim inf Jnt(@) me tA) 
n—+00 n+1 n n 
= lim inf fr+i(a) 
noo n+l ; 


We show f is S invariant in the same manner. We next show that fg exists, is 


integrable and S-invariant. It suffices to show that 


[few [otis f fan 


Since f — f = 0, this would imply that f= f = fs ae. Let M>0ande>0 
be fixed and 
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fu(x) =min{f(x),M}, «eX. 


Define n(x) to be the least integer n > 1 such that 


n-l1 
fu(a) < inl) eee So f(Sia)+e, 2ex. 
n nr a0) 


Note that n(x) is finite for each 2 € X. Since f and f(x) are S-invariant, we have 


n(x)—1 
fu)(a f(s (x)e, 2eEX, (3.1) 
j=0 


na) fart) < nla 22 fu) 


Choose a large enough N > 1 such that 


p(A) < a with A={xeEX:n(x) > N}. 


Now we define f and 7 by 


0, reEeA Ls crEA 


Then we see that for all « © X 


n(x) < N, _ by definition, 


m(S!x) < f(S!x) + A(a)e, (3.2) 
by (3.1) and S-invariance of fj, and that 


[i fau=f tans f Fan 


| fet | fans | way (3:3) 


=f rus fas f faut 
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Furthermore, find an integer L < 1 so that 4“ < € and define a sequence {nx(x)}®.9 for 


each x € X by 
no(z) =0, ng(x) = ng_i(z) HA(S™™) az),  k>1. 


Then it holds that for « € X 


L—-1 7 k(x) nz(x)—-1 7 L-1 7 
mM(S2)= >) SD) fu(S'2)+ So frr(S?2), 
j=0 k=1 j=ng_1(a) J=Nk(x)(#) 


where k(x) is the largest integer k > 1 such that ng(a) < LD —1. Applying (3.2) to each 
of the k(x) terms and estimating by M the last L — njz)(x) terms, we have 


Deal - k(x) ng(x)—-1 7 L-1 7 
mS" a) =o M(S?x) + Ss fu(S?x) 
j=0 k=1 j=nz_1 (x) J=Nk(x) (x) 


since f > 0, fu < M and L— N(x) (2) < N —1. If we integrate both sides on X and 
divide by L, then we get 


[fans f Faure 


by the S-invariance of yw, (3.3) and AM <e. Thus, letting « — 0 and M —> o give 
the inequality fdw< ie f du. The other inequality f f du < eg f du can be obtained 


(N—1)M 
Sa sf tan tse 


similarly. Hence, f = f = fs a.e., and fs is S-invariant. If S is ergodic, then S-invariance 


of fg implies that fs is a constant a.e and 


fs(x) = I. fs(y) duly) = I f(y) duly). 


Al 


Chapter 4 


Conclusion 


In this thesis, we presented some topics in information theory and developed 
some examples as applications in different fields of study. As we developed this topics, we 
unraveled the importance of measure theoretic and functional analysis methods in order 
to define and characterize some of the properties in information theory. 

In chapter one, to describe the amount of information of a source, we developed 
the theory of entropy. Specifically, we defined Shannon entropy for finite scheme and 
some of its properties. We showed some examples by finding the entropy of Bernoulli, 
Geometric, and Poisson probability distributions. We also defined the entropy function 
and provided a useful inequality in coding theory. In next chapter, we defined a dynamical 
system and a measurable partition to define the Kolmogorov-Sinai entropy. Next, we 
stated and proved the Kolmogorov-Sinai theorem. We defined and showed that Bernoulli 
Shifts with the same entropy are isomorphic. To finish, we calculated the entropy of 
Markov Shifts by also using Kolmogorov-Sinai theorem. 

In the last chapter, we do not only define relative entropy for finite probability 
distributions and probability density functions, but also we extended this definition to 
an arbitrary pair of probability measures and then defined Kullback-Leibler information. 
We proved some essential properties of the relative entropy and provided an application 
in statistical hypothesis testing. Furthermore, we calculated the relative entropy for 
the Uniform, Normal, and Exponential distributions. Finally, we briefly developed the 
concept of ergodicity and proved one of the main results, The Birkhoff Ergodic Theorem. 
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